Data access system

ABSTRACT

A method of automatically creating a database on the basis of a set of category headings uses a set of keywords provided for each category heading. The keywords are used by a processing platform to define searches to be carried out on a plurality of search engines connected to the processing platform via the Internet. The search results are processed by the processing platform to identify the URLs embedded in the search results. The URLs are then used to retrieve the pages to which they refer from remote data sources in the Internet. The processing platform then filters and scores the pages to determine which pages are the most relevant to the original categories. Internet location information for the most relevant pages is stored in the database.

BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention relates to improved data access efficiency. Inparticular, the invention finds application in the area of tailoreddatabase creation.

2. Related Art

An example of an environment in which data is stored in a highlydistributed fashion is the World Wide Web (WWW). The WWW is a vast,unstructured collection of information stored on many different serversaround the Internet. Latest estimates put the number of individual pagesof information at over 30 million and the number of servers at over2,250,000.

Navigating around this quantity of data is particularly difficultwithout some assistance. Aids for navigation such as Indexes andDirectories have been created for the WWW and represent the two mainnavigation approaches for the WWW.

In the case of Indexes, so-called search engines, for example AltaVista(http://altavista.digital.com), retrieve as many WWW pages as possibleand index the words in each page. Typically, a search engine runsprocesses, known as robots or spiders, which exhaustively follow allhyper-links, embedded in retrieved pages, in selected areas of the WWW.A large Internet search engine may have a stored index of many millionsof pages. Users are then able to enter a keyword, which is compared tothe index entries, and receive a list of pages that contain thatrequired keyword. This is a simple method of finding information whichis, however, limited in effectiveness by how comprehensive and accuratethe keyword indexes are.

A Directory, for example Yahoo, comprises a hierarchy of categoriesrelated to a particular topic. The hierarchy is defined by the creatorof the Directory and has entries, under the lowest level categories,typically added by the Directory supervisor(s) or sometimes by users. Itis easier to find information in a Directory than by using a searchengine, as in a Directory the choices are constrained by the known topicarea categories. However, the effectiveness of Directory-type WWWnavigation is limited by the rigid categorisation scheme. This leads totwo disadvantages: firstly, the categorisation of a particular headingmay need to change leading to extensive manual re-working of thedirectory, and secondly, the scheme may not be suitable or intuitive forsome users.

SUMMARY OF THE INVENTION

In an article entitled “Navigating with a Web Compass”, R. Baldazo,Byte, March 1996, McGraw-Hill, USA, volume 21, number 3, pages 97-98,there is a described a search tool which can cause search engines toperform a search on the basis of specific search terms.

According to one aspect of the present invention, there is provided anapparatus for populating a destination database, said apparatuscomprising:

a destination database;

means for connecting the apparatus to a distributed source database;

a memory area for containing a group of keywords related to apredetermined subject category;

means for controlling at least one search engine associated with saiddistributed source database, on the basis of a group of keywordscontained in said memory area, to provide search results includinginformation relating to the location of documents containing saidkeywords, said documents being stored in said distributed sourcedatabase;

means for scoring each of the documents identified in search resultsprovided by said at least one search engine on the basis of therespective contents thereof in accordance with predetermined criteria;

means for selecting at least some of the documents scored by saidscoring means, on the basis of their respective scores; and

means for storing, for each document selected by said selecting means,in said destination database information relating to the location of thedocument in said distributed source database.

The invention overcomes problems associated with the known searchmethods by providing a practical way of populating a destinationdatabase from a much larger distributed database.

For the present purposes, and unless otherwise stated, the terms “pages”and “documents” both refer to a compilation of data stored at a singlelocation, for example, in the WWW and are therefore interchangeable.

One example where embodiments of the invention might be useful is in aschool environment in which a teacher wishes to limit the amount andtype of information available to students. Once defined, the data in thedatabase can be freely “browsed” by the students without the fear offinding inappropriate or irrelevant information. Another example is acompany-wide environment where only commercial information on certainsubjects, and not academic information, is required.

The memory area may be arranged to contain a set of groups of keywords,each group of keywords being related to a respective predeterminedsubject category. The categories may be arranged hierarchically with themost generic category at the highest level of a directory, and withincreasingly more specific categories branching out at lower levels,similar, for example, to the Directories described above.

In preferred embodiments, a further step of reducing the number ofretrieved document file references is included before the data isfinally stored in the destination database. This allows removal ofreferences to inappropriate or irrelevant documents. For example, if therequired database has a category of “furniture” and a keyword of“tables”, it would be sensible and desirable to remove references todocuments relating to mathematical “tables”.

According to a second aspect of this invention, there is provided anapparatus for populating a destination database, said apparatuscomprising:

a destination database;

a memory area for containing a group of keywords related to apredetermined subject category; and

a processing platform which can access said destination database andsaid memory area, which is connectable to a distributed source database,and which is arranged to:

control at least one search engine associated with said distributeddatabase, on the basis of a group of keywords contained in said memoryarea, to provide search results including information relating to thelocation of documents containing said keywords, said document beingstored in said distributed source database;

score each of the documents identified in said search results on thebasis of the respective contents thereof in accordance withpredetermined criteria;

select at least some of the documents on the basis of their respectivescores; and

store, for each selected document, in said destination databaseinformation relating to the location of the document in said distributedsource database.

According to a third aspect of this invention, there is provided anapparatus for populating a destination database, said apparatuscomprising:

a) means for connecting the apparatus to a distributed source database;

b) a first memory area for storing a group of keywords associated with apre-determined subject category;

c) means for reading a keyword from the first memory area andtransmitting said keyword to search means, said search means havingaccess to the distributed source database;

d) means for receiving search results from said search means and storingsaid results in a second memory area, said results including informationrelating to the location of documents stored in said source databasecontaining said keyword;

e) means for identifying and storing said location information in athird memory area;

f) means for reading location information from the third memory area andtransmitting a request to the source database to return a copy of adocument associated with selected location information to the apparatus;

g) means for receiving and storing said copy of said document in afourth memory area;

h) means for accessing and scoring each of the documents stored in thefourth memory area on the basis of the respective contents thereof inaccordance with pre-determined criteria; and

i) means for selecting at least some of said documents on the basis ofthe respective scores and, for each selected document, storing in adatabase information relating to the location of the document in thesource database.

According to a fourth aspect of this invention, there is provided amethod A method of populating a destination database, said methodcomprising the steps of:

controlling at least one search engine associated with a distributedsource database, on the basis of a group of keywords related to apredetermined subject category, to provide search results includinginformation relating to the location of documents containing saidkeywords, said documents being stored in said distributed sourcedatabase;

scoring each of the documents identified in said search results on thebasis of the respective contents thereof in accordance withpredetermined criteria;

selecting at least some of the documents on the basis of theirrespective scores; and

storing, for each selected document, in said destination databaseinformation relating to the location of the document in said distributedsource database.

BRIEF DESCRIPTION OF THE DRAWINGS

An embodiment of the present invention will now be described, by way ofexample only, with reference to the accompanying drawings, in which:

FIG. 1 is a diagram showing a system for creating a database and whichembodies the present invention;

FIG. 2 is a diagram of the main data storage areas used in the system ofFIG. 1;

FIG. 3 is a flow diagram illustrating a method for building thedatabase;

FIG. 4 is a diagram representing how the data is arranged in thedatabase;

FIG. 5 is a flow diagram illustrating how documents are retrieved fromdistributed data sources when building the database;

FIG. 6 is a flow diagram illustrating how search results are processedin the system of FIG. 1; and

FIG. 7 is a flow diagram illustrating how documents are scored in thesystem of FIG. 1.

DETAILED DESCRIPTION OF EXEMPLARY EMBODIMENTS

In FIG. 1, there is shown a system suitable for building and populatinga destination database in accordance with an embodiment of the presentinvention.

In this system, a processing platform (PP) 100 controls access to adestination database 112, held in a secondary storage device 110 (forexample a hard disk drive) connected to the PP 100. The destinationdatabase 112 can be located separately and remotely from PP 100. Whenpopulated, the database 112 comprises compiled data stored in datatables 115. The PP 100 is also connected, via the Internet 120, to aplurality of search engines 130 a, 130 b and 130 c (only three of whichare illustrated for clarity). The search engines 130 a, 130 b and 130 care arranged to search information servers connected to the Internet.These information servers form a distributed source database.

The PP 100 in this embodiment is a UNIX (TM) based computing platformrunning appropriate software. The PP 100 can be a server in aclient/server environment. For example, it might be accessible to usersacross a local area network in an office or school via personalcomputers (PCs) 140 a, 140 b and 140 c. Each PC 140 a, 140 b and 140 cis able to access the required data in the database 112 under thecontrol of appropriate controlling software running on the PC andcommunicating with the PP 100. The software on the PCs can bepermanently stored in local memory (not shown) or can be retrieved fromthe PP 100 when required and executed. The software, when executed,presents a graphical user interface (GUI) to the user of the typecommonly associated with known Directory search software such as Yahoo.The details of the access software need, thus, not be discussed in anyfurther detail.

FIG. 2 illustrates the main memory areas which are accessed and/orcreated during the operation of the present embodiment. In this case thememory used is secondary storage, for example a hard disk, although mainmemory, for example RAM, is typically used for manipulating the databefore it is stored, or restored, in the secondary storage. In thepresent embodiment, the memory areas share the secondary storage device110 with the database 112. The memory areas are for: a category list200, a keyword list 205, a URL store 210, a document store 215, a reportstore 220, a search engine list 225, search engine syntaxes 230, searchresult files 235 and an initial list of URLs 240. Of these areas, thedata in the category list 200, the keyword list 205, the search enginelist 225 and the search engine syntaxes 230 is defined before theoperation of the present embodiment. The remaining areas are used inoperation.

FIG. 3 in combination with the following description illustrates onemethod in accordance with the present invention for building andpopulating the database 112.

In step 300, a category list is provided in memory area 200 comprisingheadings and categories for inclusion in the database 112. The headingsand categories provide a 1:1 mapping with those which will beuser-selectable in the completed database. Next, in step 302, a list ofkeywords is generated for each of the categories for inclusion in thecategory list. The headings, categories and keywords are also providedwith associated information which dictates exactly how they are intendedto relate to one another in the database. The structure of the categorylist is described in more detail below. The keywords are provided forthe purpose of building the database but do not, as such, form part ofthe ultimate database structure.

Once provided for each of the categories, the keywords are combined intoa single keyword list, in step 304, and stored in memory area 205.

In a step 306, the keywords are passed in turn to one or more of thesearch engines 130 a, 130 b and 130 c for keyword searching. In aconventional manner, each search engine produces search results for eachkeyword. For each keyword/search engine combination, the search resultsare stored in an individual text file in memory area 235.

Still in step 306, the URLs (Universal Resource Locators) are extractedfrom the search results held in each individual text file stored inmemory area 235. As a result, there is obtained an individual list ofURLs for the complete set of search results for each search engine. TheURLs point to the candidate pages for the keywords found by the searchengines. The URL lists are stored in memory area 240.

In a step 308, any errors resulting from the searches are recorded inmemory area 220 in an error log for future analysis.

The process of step 306 for initiating searching by sending keywords tothe search engineer, retrieving and storing the results and processingthe results to obtain the URL lists is described in more detail belowwith reference to FIGS. 5 and 6.

The URL lists produced by the search engines are then combined in step312 to form a single list of URLs arranged by keyword. This single listis stored in memory area 215. This step includes removing duplicate URLswhere more than one search engine has raised the same URL for the sameword. Obviously, this step need only be carried out if more than onesearch engine is used. The result of this is effectively a list of URLsfor each keyword.

Then, in step 316, the list of URLs (which as mentioned above isarranged by keyword) is cross-referenced back with the originaldescription of categories and keywords, generated in step 302, toidentify those URLs which are candidates for each category. The URLs foreach keyword in each category are then filtered, in step 320, to removepages which are not to be processed. Pages which are commonly removed atthis point include non-http references and other non-promising sitessuch as foreign language sites. This step may include checking each URLagainst a black (or indeed a white) list of sites, where a black-listedsite would definitely not provide appropriate or relevant informationand a white-listed site may be known to provide appropriate, goodquality information. It will be appreciated that any suitable filteringpolicy could be employed at this point to create a directory of thedesired type.

The details of filtered-out URLs are reported in step 322 to a text fileto allow periodic checks to be carried out to ensure that goodinformation is not being rejected. Statistics can be generated, in step326, to help improve the filtering process (step 320) in the future.

In step 330, each of the URLs for each category, which remains after thefiltering step 320, is used to retrieve the respective WWW page. Theretrieved pages or documents are stored in memory area 215. In step 338,each retrieved page is scored against the category in which it isproposed. The scoring process, which in effect determines the relevanceof a retrieved page and accordingly whether it should be included in thedatabase, is described in more detail with reference to FIG. 7.

Once the pages have been scored, in a second filtering step 342, acut-off point is introduced to indicate which pages, having scores abovethe cut-off point, are to go into the database 112. The cut-off pointcan be a fixed value indicative of, for example, a measure of the‘relevance’ of a page (determined, for example, by the number ofoccurrences of the keyword in the page), to limit the maximum number ofpages in each category. Alternatively, the cut-off point can bedetermined in some other appropriate manner depending on the type ofsearch results achieved for particular categories.

The output of the second filter step 342 is arranged, in step 346, intoa format that can be fed into the database 112 using conventionaldatabase loading routines. One suitable form of database for database112 is a relational database, such as Oracle (TM).

A suitable logical data arrangement for the database 112 is illustratedin FIG. 4. The available headings for the database 112 are stored in afirst table 400. Each heading (HEAD1, HEAD2, etc) includes a referenceto a category table 410 which lists the categories available to the userunder that heading. Each category has two references: a first referenceto a URL table 420 and a second reference to a title and descriptiontable 430. The title and description tables 430 hold the title and abrief summary of each page (having scores above a threshold, or the topn scores) which a user can access. The URL tables mirror the title anddescription tables by holding the URL for each accessible page. The URLsare used to retrieve the pages where required.

Once the data from step 346 has been arranged into a form suitable forloading into a database such as that described above, in step 360, theloading routines pass the data to the database 112 to be loaded into thedatabase tables which have previously been created. Thus, the database112 is populated in step 360. At this point, the database 112 is readyfor use.

Once stored in the database 112, the data is accessed in a conventionalmanner using standard database scripts. The data is then presented to auser via a “front-end” user interface which can be made, for example, toresemble conventional search engines or Directories such as Yahoo.Database script and interface development uses conventional techniqueswhich are not within the scope of the present description and will notbe described in any more detail. For further information, the reader isreferred to texts such as the Oracle (TM) users' guides.

An extract of an exemplary category list is reproduced below forreference to be used in combination with the subsequent description. Thecategory list reflects the categories required in database 112.

(ballet:1:3, conga:2:1, samba:3:1, tango:1:1, waltz:2:1, modern:0:1,choreography:1:3, dance:1:5, rhumba:1:1, country dance:1:1,minuette:0:1, flamenco:0:1, cossack:0:1, disco:0:1)

=>films %%% hobbies & sports:cinema

(cinema:2:5, movie:1:5, flicks:2:1, movie-theatre:1:3, blockbuster:0:1,film:1:5, arthouse:0:1, hollywood:0:1, oscar:0:1, actor:0:1,actress:0:1, sound track:0:1, video:0:1)

=>photography %%% hobbies & sports:cameras & photos

(camera:1:5, photo:1:5, slide:6:1, photograph:1:5, flash:0:1, kodak:0:1,olympus:0:3, pentax:0:3)

>>>>>hobbies & sports

=>cameras & photos

(camera:1:5, photo:1:5, slide:6:1, photograph:1:5)

=>cinema

(cinema:2:5, movie:1:5, flicks:2:1, movie-theatre:1:3, blockbuster:0:1,film:1:5, arthouse:0:1)

=>gardening

(flower:1:3, plant:2:3, garden:3:5, tree:1:3, fruit:1:1, vegetable:2:1)

The category list is provided as a text file in a pre-determined format,similar to that illustrated above, which represents how keywordsinter-relate. The text file is used by the system when database buildingcommences.

In this example, there is a “heading” level above the category level.For convenience, only two headings (denoted by the syntax “>>>>>”) areshown: “arts & entertainment” and “hobbies & sports”. These headings,which in practice form part of a longer list of headings (which forconvenience have not been illustrated) and which are of interest to apre-determined group of users, are selected to form the top level set ofchoices presented to a user of a system once the database 112 has beenbuilt. These headings are eventually stored in the headings table 400 ofthe database.

Each heading has a list of categories (denoted by the syntax “=>”).These categories are presented to a user of the system in response toone of the headings being selected (in this example, only threecategories for each heading are shown). The categories are stored in thecategory tables 410 in the database. Each category has a plurality ofassociated keywords, separated by commas. The keywords are stored in theURL tables 420 in the database. Each keyword has links to its URLs inthe relevant URL table 320 and title and description data in therelevant title and description table 430.

As shown in the example, some categories are followed by further wordsor phrases preceded by a “%%%” syntax. This syntax is translated as “SeeAlso”. See also is a useful and commonly used technique which makes auser aware that further information on a certain topic can be foundelsewhere. Using the “films” category as an example, the see also optionrefers to “hobbies & sports: cinema” which means see also the “cinema”category under the “hobbies & sports” heading. The see also entries canbe stored in separate database tables, which have not been shown, andare referenced by the categories in the categories tables 410.

A further feature of the category list apparent from the exampleillustrated above is that each keyword has associated with it twonumbers. The first number is the “sense” or meaning of the word. Forexample, “slide” can be a noun or a verb and as a noun “slide” can meana photographic slide or a slippery slopping surface in a park forchildren to slide down. The sense is determined by reference to adictionary source, and most preferably a dictionary source available incomputer-readable format. Obviously, it is critical that the samedictionary source is used throughout the processing stages to ensurethat the sense label for a word remains consistent.

The second number relates to the weighting of the keyword, or howrelevant the keyword is to its associated category. For example, theweighting for “camera”, under the category “photography”, is 5 (where 5represents the highest relevance) whereas the weighting for “flash” isonly 1 (representing the lowest relevance). The influence these numbershave over the database creation is described in more detail below.

The category list itself can be defined by any method, or in any way.The easiest way to build the file is by hand, with a user (or thecreator of the database) entering all keywords, for example by trial anderror, to build the required database. However, more efficient methodsof building the list are envisaged. For example, adaptive software couldbe provided for which a user is required to provide (or select from theWWW, for example) a number of pages of information relating to aspecific subject in which the user is interested. On the basis of thepages provided, the software would determine commonly occurring words orphrases which are indicative of the area of interest. For example, ifthe user is interested in photography and provides three pages on thesubject of photography, the software may well be able to highlight wordssuch as “camera”, “projector”, “photo”, etc., as being indicative ofphotography. In this way, for every subject of interest to the user,sample pages would be provided, keywords would be generatedautomatically and a category list would then be generated automatically.Alternatively, a computer-based thesaurus could be used to generatekeywords for a single input category. Such a thesaurus could also beadaptive to include or discard certain words associated with certaincategories (or vary a weighting associated with a word) on the basis ofdocuments which were, at a later time, discarded from the database.

At the extreme, software could be provided to determine from, forexample, the last x pages of information accessed on the Internet by auser, the main subject interest areas for that user on the basis ofwhich subjects areas most commonly arise. Then, the actual category listas well as the keyword list could be defined by the software and theultimate database build without any human intervention at any stage.

Other methods of simplifying category list creation will become apparentto the skilled person on reading the present description.

Thus, the present embodiment allows the required database to be createdautomatically, apart from supplying the category list and, possibly, thekeywords.

The way in which search engines are used to generate URL lists is nowdescribed. Search engines are commonly used in association with the WWW.A search engine typically carries out a search in response to an httpcommand. Essentially, such a command comprises a URL for the searchengine and a fixed syntax specifying the search to be carried out by thesearch engine at that URL. Typically, the syntax for the same commandfor different search engines varies and needs to be determined beforesuccessful searching in this manner can be achieved on different searchengines. The following http command example tells the search engine“AltaVista” to perform a search on the keyword “slide”:

(altavista's URL/cgi-bin/query?pg=q&what=web&fmt=.&q=slide.

Obviously, it would be onerous to have to learn the correct syntax foreach search engine, and search engine providers overcome this need byproviding ‘user-friendly’ front end GUIs which are executed on a user'slocal computer system. The GUI allows a user to type a keyword in, forexample, an appropriate box on a display screen and submit the searchrequest by pressing a “submit” button (by positioning a mouse pointerover the button on the display screen and clicking the left-hand mousebutton). The GUI takes the greatly-simplified user input, bundles itinto the more complex form, similar to that shown above, and transmitsit to the search engine.

In response to this, the search engine searches its database to find therelevant keyword and returns data including URLs for pages containingthe keyword. Search results typically also include, for each page found,a title and a brief summary of the contents of the page. From thedisplayed search results, a user is able to click on any of the URLs inresponse to which the search engine retrieves the page indicated by theURL.

In general, the returned data is typically in the form of an html(Hypertext Markup Language) page of information. The page comprisesunformatted text and respective html codes which define how the textshould be displayed and how a user can interact with it within a GUIenvironment. Typically, within the unformatted text, headings, titles,body text are distinguished using different html codes. The GUI takesthe html page and interprets the codes for the page to be presented to auser as a suitably-formatted, interactive graphical display of theinformation.

Since html is an industry-wide standard, it is a relatively simple taskto read the bare, unformatted text and html codes and interpret whichtext relates to descriptions of, for example, WWW pages and which textrelates to URLs etc. This is exactly what is done by the presentembodiment, as described below.

With reference to FIG. 5, the process for controlling the search enginesis as follows. In step 510, the search engine to be accessed is selectedfrom a list of available search engines held in a file in memory area225. Then, the syntax for the selected search engine is read from aseparate text file in memory area 230, in step 520, for use in formingthe search requests for this search engine. In step 530, the keywordlist is accessed and the first keyword read from the list. The keywordis incorporated into an http command, in step 540, using the appropriatesyntax. The command is then transmitted to the search engine in step550. The process then awaits the search results until, in step 560, theresults are received. The results are then stored in memory area 235 ina text file in step 570, where a new text file is used for each searchengine/keyword combination.

The process follows return branch B unless all keywords have beensearched and then follows return branch A (beginning again from the topof the keyword list) to select the next search engine on the list,unless all search engines have been accessed. The result of the processis a set of text files containing all the search results which arestored in memory area 235 for future reference.

The process shown in FIGS. 5 (and each of the following processes) isenacted by a software routine or batch file which is run on the PP 100,for example overnight, when communications costs are minimal.

The process described with reference to FIG. 5 is also able to reactwhen a search engine or keyword request to a search engine does notrespond, by moving on to the next search engine in the list, or the nextkeyword, whilst recording the failure of the search engine or requestthereto to an error log (the error recording step 308 is shown in FIG.3).

FIG. 6 illustrates the steps carried out in relation to the searchresult text files. For convenience, only one text file, representing onesearch engine/keyword combination, is considered.

In FIG. 6, the search results text file is opened in step 605 to be readsequentially, character by character. From the start of the file, andwhile the end of the file has not been reached (step 610), a singlecharacter is read from the file in step 615. In step 620, if thecharacter is not a < character, then branch A is followed back to step610 and a character counter (not shown), which dictates where charactersin the file are read from, is incremented. The < character indicates inhtml that the following text (or part thereof) is an html code asopposed to normal text. In particular, URLs are identified by the htmllabel a_href after a < character.

If the character is a <, the process goes to step 625. In step 625, thenext eight characters are read from the file. These next eightcharacters are examined in a step 630. If these next eight charactersare a_href=“, then all following characters are read until the next ″character in step 635. If the next eight characters are not a_href=″,then branch B is followed from step 630 back to step 610 with thecharacter counter being incremented by one. Further tests (which are notdescribed in detail) determine whether the retrieved characters relateto URLs or to HyperText links, where HyperText links are not requiredfor the present purposes.

After step 635, in step 640, the text between the first and second ″characters, which represents a URL, is stored in a text file in memoryarea 240. The character counter is incremented by however manycharacters were read between steps 610 and 635.

Then, branch C is followed unless the whole text file has been read. Ifthe whole file has been read, the file is closed and the process endsfor the search engine/keyword combination under consideration.

The process shown in FIG. 6 is performed on the text file for eachsearch engine/keyword combination. As a result, there is obtained anindividual list of URLs extracted from the search results for eachsearch engine. These lists are stored in memory area 240. After storingthese lists of URLs in memory area 240, the overall process shown inFIG. 3 continues with step 312.

The method of scoring retrieved pages will now be described withreference to FIG. 7. Scoring is carried out after all documents for aparticular category, a document set, have been retrieved since documentscores relate in part to the content of the other documents in the set.In step 710, all the documents in the document set are pre-processed.Pre-processing requires several steps. Firstly, each page is split intotitle, headings, and the body of the text. With respect to the body ofthe text, that is to say the actual descriptive text of the page, theamount of text to be processed is fixed to limit the amount of overallprocessing. For example, only the first 20 lines may be used forscoring. Next, meaningless words and terms are removed to make scoringmore efficient. For example, words such as “and”, “the”, “but”, “a”,“however”, “since”, etc, are all removed since they add nothing to theinformation content of a page. The next step in the pre-processing is toconvert all words to root form, for example by making all nounssingular, converting all adverbs back to adjectives and converting allverbs to their infinitive form.

Once pre-processing is complete, the documents in the set are morereadily comparable and scores assigned thereto have more meaning.

In step 720, a constant called “totalwords” is calculated which is thetotal number of words after pre-processing which remain in the set ofdocuments. Then, in step 730, the number of occurrences “wordcount” ofeach keyword in the category is calculated for all documents in the set.

The remainder of the procedure (steps 740, 750, 760, 770) is carried outon each document in turn. Step 740 marks the beginning of the procedurewhich is carried out on each document.

In step 750, Algorithm A is used to operate on the body of the text ofthe document for each keyword in the category to produce an individualscore for each keyword. Algorithm A is as follows: $\begin{matrix}{{score}\quad = \frac{{weight}\quad \times {count}\quad \times \quad {totalwords}\quad \times {order}}{{words}\quad {in}\quad {item}\quad \times {wordcount}}} & {{Algorithm}\quad A}\end{matrix}$

weight: relevance attached to the word within a set of keywords for thecategory, as described above

count: number of times the word appears in the page.

totalwords: number of words in the set of pages, as described above.

words in item: number of words in the page.

wordcount: number of times the word appears in the set of pages.

order: number of single words in keyword.

In step 760, Algorithm B is used to score the titles and headings foreach keyword in the category to produce a set of individual scores foreach keyword. Algorithm B is as follows:

Algorithm B

score (title)=100×weight×count×order

score (heading 1)=100×weight×count×order

score (heading 2)=50×weight×count×order

score (heading 3)=20×weight×count×order

etc

weight: relevance attached to the word within a set of keywords ofcategory, as described above.

count: number of times the keyword appears in the respective title orheading.

order: number of single words in keyword.

As can be seen, the title is given the main importance, along with thefirst main heading. Subsequent sub-headings are given respectively lowerscores since they are typically less relevant.

Once a page has been pre-processed and scored as described, the totalscore for the page is obtained by summing, in step 770, all theindividual keyword scores from Algorithms A and B. Then, as describedabove, if the score is above a certain threshold value for the categoryit was scored against, the page is deemed to be relevant to thatcategory and is eventually included in the database. As already stated,methods other than thresholding may be employed to determine whichdocuments and included in the database, for example the top n scoringdocuments may be used, or the top m percentile of the documents.

The threshold, in this case, is determined by taking a sample of scoredpages and checking them individually to see how relevant they are to acategory and using their scores to judge the threshold value for thatcategory.

Obviously, it is desirable to remove irrelevant documents includingdocuments having keywords of the wrong sense for its allocated categoryfrom the database. This can be done before scoring in order to reducethe amount of processing required for scoring. Obviously, this step canbe achieved manually by a human reading the page and making thedecision. However, it is anticipated that natural language processingalgorithms, which look inside sentences and understand sentencestructure and semantics, will allow to a high degree of accuracy thesense of a word to be determined and will thus replace manual humanintervention. Such techniques are widely reported and are becoming moreefficient, and are beyond the scope of the present description. Sufficeit to say that such techniques are preferably used in the presentembodiment to determine the sense of keywords in the page. Thedetermined sense is then compared with the required sense which isdefined in the category list described above.

However, removal of irrelevant documents could be done at any point, forexample even after the database is formed, by deleting URL entries.

It has been shown, however, that the scoring process described aboveitself removes such irrelevant documents naturally by allocating them alow score which is below the required threshold. For example, ifdocuments relating to photographic slides were required and some(irrelevant) documents relating to children's slides were retrieved, thescoring process (looking for other relevant keywords such as “camera”,“photograph” and “flash” etc.) would very likely not find the otherkeywords and accord the children's slide documents with correspondinglylow scores and so such documents would be filtered out naturally.

Although in the system described above the distributed source databasetakes the form of information servers connected to the Internet, thesystem of the invention is suitable for use with other distributedsource databases.

What is claimed is:
 1. An apparatus for populating a destinationdatabase, said apparatus comprising: a destination database; means forconnecting the apparatus to a distributed source database; a memory areafor containing a set of groups of keywords, each group of keywords beingrelated to a predetermined subject category; means for controlling atleast one search engine associated with said distributed sourcedatabase, on the basis of said groups of keywords contained in saidmemory area, to provide search results including information relating tothe location of documents containing said keywords, said documents beingstored in said distributed source database; means for scoring each ofthe documents identified in search results provided by said at least onesearch engine on the basis of the respective contents thereof inaccordance with predetermined criteria; means for selecting at leastsome of the documents scored by said scoring means, on the basis oftheir respective scores; and means for storing, for each documentselected by said selecting means, in said destination databaseinformation relating to the location of the document in said distributedsource database and its predetermined subject category.
 2. An apparatusas in claim 1, further comprising: means for identifying locatinginformation contained in search results provided by said at least onesearch engine and storing said location information in a second memoryarea.
 3. An apparatus as in claim 1, further comprising: means forretrieving documents identified in search results provided by said atleast one search engine and storing said retrieved document in adocument store.
 4. An apparatus for populating a destination database,said apparatus comprising: a destination database; a memory area forcontaining a set of groups of keywords, each group of keywords beingrelated to a predetermined subject category; and a processing platformwhich can access said destination database and said memory area, whichis connectable to a distributed source database, and which is arrangedto: control at least one search engine associated with said distributeddatabase, on the basis of said groups of keywords contained in saidmemory area, to provide search results including information relating tothe location of documents containing said keywords, said document beingstored in said distributed source database; score each of the documentsidentified in said search results on the basis of the respectivecontents thereof in accordance with predetermined criteria; select atleast some of the documents on the basis of their respective scores; andstore, for each selected document, in said destination databaseinformation relating to the location of the document in said distributedsource database and its predetermined subject category.
 5. An apparatusfor populating a destination database, said apparatus comprising: a)means for connecting the apparatus to a distributed source database; b)a first memory area for storing a set of groups of keywords each groupof keywords being associated with a pre-determined subject category; c)means for reading a keyword from the first memory area and transmittingsaid keyword to search means, said search means having access to thedistributed source database; d) means for receiving search results fromsaid search means and storing said results in a second memory area, saidresults including information relating to the location of documentsstored in said source database containing said keyword; e) means foridentifying and storing said location information in a third memoryarea; f) means for reading location information from the third memoryarea and transmitting a request to the source database to return a copyof a document associated with selected location information to theapparatus; g) means for receiving and storing said copy of said documentin a fourth memory area; h) means for accessing and scoring each of thedocuments stored in the fourth memory area on the basis of therespective contents thereof in accordance with pre-determined criteria;and i) means for selecting at least some of said documents on the basisof the respective scores and, for each selected document, storing in adatabase information relating to the location of the document in thesource database and its predetermined subject category.
 6. A method ofpopulating a destination database, said method comprising: controllingat least one search engine associated with a distributed sourcedatabase, on the basis of a set of groups of keywords, each group ofkeywords being related to a predetermined subject category, to providesearch results including information relating to the location ofdocuments containing said keywords, said documents being stored in saiddistributed source database; scoring each of the documents identified insaid search results on the basis of the respective contents thereof inaccordance with predetermined criteria; selecting at least some of thedocuments on the basis of their respective scores; and storing, for eachselected document, in said destination database information relating tothe location of the document in said distributed source database and itspredetermined subject category.